Automatic Genre-Specific Text Classification
نویسندگان
چکیده
INTRODUCTION Starting with a vast number of unstructured or semi-structured documents, text mining tools analyze and sift through them to present to users more valuable information specific to their information needs. The technologies in text mining include information extrac-In this chapter, we share our hands-on experience with one specific text mining task — text classification [Sebastiani, 2002]. Information occurs in various formats, and some formats have a specific structure or specific information that they contain: we refer to these as`genres'. Examples of information genres include news items, reports, academic articles, etc. In this paper, we deal with a specific genre type, course syllabus. A course syllabus is such a genre, with the following commonly-occurring fields: title, description, instructor's name, textbook details, class schedule, etc. In essence, a course syllabus is the skeleton of a course. Free and fast access to a collection of syllabi in a structured format could have a significant impact on education, especially for educators and lifelong learners. Educators can borrow ideas from others' syllabi to organize their own classes. It also will be easy for lifelong learners to find popular textbooks and even important chapters when they would like to learn a course on their own. Unfortunately, searching for a syllabus on the Web using Information Retrieval [Baeza-Yates & Ribeiro-Neto, 1999] techniques employed by a generic search engine often yields too many non-relevant search result pages (i.e., noise) — some of these only provide guidelines on syllabus creation; some only provide a schedule for a course event; some have outgoing links to syllabi (e.g. a course list page of an academic department). Therefore, a well-designed classifier for the search results is needed, that would help not only to filter noise out, but also to identify more relevant and useful syllabi. This chapter presents our work regarding automatic recognition of syllabus pages through text classification to build a syllabus collection. Issues related to the selection of appropriate features as well as classifier model construction using both generative models (Naïve) are discussed. Our results show that SVM outperforms NB in recognizing true syllabi.
منابع مشابه
Retrieval Models for Genre Classification
Genre provides a characterization of a document with respect to its form or functional trait. Genre is orthogonal to topic, rendering genre information a powerful filter technology for information seekers in digital libraries. However, an efficient means for genre classification is an open and controversially discussed issue. This paper gives an overview and presents new results related to auto...
متن کاملAutomatic Genre Classification for Resource Scarce Languages
In this article we present research on the development of automatic genre classification systems for resource scarce languages. The main approaches to text classification from literature are presented and weighed against each other during an experimental phase, to identify the most appropriate text classification approach to be used as a genre classification system. A fixed feature set is extra...
متن کاملARF @ MediaEval 2012: Multimodal Video Classification
In this paper we study the integration of various audio, visual and text-based descriptors for automatic video genre classification. Experimental validation is conducted on 26 video genres specific to web media platforms (e.g. blip.tv).
متن کاملAutomatic Genre Identification: Towards a Flexible Classification Scheme
This paper presents an automatic genre classification model that implements a flexible classification scheme, i.e. a scheme capable of performing zero-, oneor multi-genre assignment. I suggest that this scheme is more appropriate for genres on the web, because many web pages have often more than one genre or none at all. The model that I propose relies on the distinction between the concepts of...
متن کاملAutomatic Metrics for Genre-specific Text Quality
To date, researchers have proposed different ways to compute the readability and coherence of a text using a variety of lexical, syntax, entity and discourse properties. But these metrics have not been defined with special relevance to any particular genre but rather proposed as general indicators of writing quality. In this thesis, we propose and evaluate novel text quality metrics that utiliz...
متن کامل